-
Notifications
You must be signed in to change notification settings - Fork 15
Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1
Are you sure you want to change the base?
Conversation
…alidate* functions
if isinstance(data_type, ScalarType): | ||
return fill_value_map.get(data_type) | ||
if isinstance(data_type, StructuredType): | ||
return "AAAAAAAAAAAAAAAA" # BUG: this does not work!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/TGSAI/mdio-cpp/blob/main/mdio/dataset_factory.h#L306C5-L317C6
Here's the implementation in C++.
I think this may handle what we need: https://github.com/zarr-developers/zarr-python/blob/ea4d7e96c0738526bbf08bb99ed0ea23a6081836/src/zarr/core/dtype/npy/common.py#L241-L258
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BrianMichell , @tasansal
After Googling, i believe the proper return value should be:
if isinstance(data_type, StructuredType):
return tuple(fill_value_map.get(field.format) for field in data_type.fields)
However, this does not seem to work and I think, it is a bug in Zarr:
For the StructuredType datatype:
dtype([('cdp-x', '<i4'), ('cdp-y', '<i4'), ('elevation', '<f2'), ('some_scalar', '<f2')])
I specify fill_value as
(2147483647, 2147483647, nan, nan)
I expect Zarr to do the fill_value serialization to JSON
However, I see the fill_value stored in .zmetadata as
"fill_value": null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! i put in lots of nitpick
if isinstance(data_type, ScalarType): | ||
return fill_value_map.get(data_type) | ||
if isinstance(data_type, StructuredType): | ||
return "AAAAAAAAAAAAAAAA" # BUG: this does not work!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.
# Let's store the data array for the second pass | ||
data_arrays[v.name] = data_array | ||
|
||
# Second pass: Add non-dimension coordinates to the data arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tasansal This was actually a hint for a discussion :-)
Just by looking at an example at https://zarr.readthedocs.io/en/stable/user-guide/arrays.html#compressors
I was hoping that the following would work out of the box, but it does not.
"compressor": _to_dictionary(v.compressor)
I see the following error:
>_compressor = parse_compressor(compressor[0])
> return numcodecs.get_codec(data)
E - numcodecs.errors.UnknownCodecError: codec not available: 'None'"
Thus, we have to have a conversion function. Shouldn't this be addressed at the schema layer?
It would be nice to have the schema that can be serialized to JSON dictionary that can be used directly to parametrize the encoder.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v1 #571 +/- ##
==========================================
+ Coverage 84.32% 90.44% +6.12%
==========================================
Files 46 72 +26
Lines 2194 3895 +1701
Branches 305 278 -27
==========================================
+ Hits 1850 3523 +1673
- Misses 301 302 +1
- Partials 43 70 +27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Outstanding work